---
title: Reading and Writing HDFS Parquet Data
---

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->

Use the PXF HDFS connector to read and write Parquet-format data. This section describes how to read and write HDFS files that are stored in Parquet format, including how to create, query, and insert into external tables that reference files in the HDFS data store.

PXF currently supports reading and writing primitive Parquet data types only.


## <a id="prereq"></a>Prerequisites

Ensure that you have met the PXF Hadoop [Prerequisites](access_hdfs.html#hadoop_prereq) before you attempt to read data from or write data to HDFS.


## <a id="datatype_map"></a>Data Type Mapping

To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type. 

Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports. These annotations specify how to interpret the primitive type. For example, Parquet stores both `INTEGER` and `DATE` types as the `INT32` primitive type. An annotation identifies the original type as a `DATE`.

### <a id="datatype_map_read "></a>Read Mapping

<a id="p2g_type_mapping_table"></a>

PXF uses the following data type mapping when reading Parquet data:

| Parquet Data Type | Original Type | PXF/Greenplum Data Type |
|-------------------|---------------|--------------------------|
| binary (byte_array) | Date | Date |
| binary (byte_array) | Timestamp_millis | Timestamp |
| binary (byte_array) | all others | Text |
| binary (byte_array) | -- | Bytea |
| boolean | -- | Boolean |
| double | -- | Float8 |
| fixed\_len\_byte\_array | -- | Numeric |
| float | -- | Real |
| int32 | Date | Date |
| int32 | Decimal | Numeric |
| int32 | int_8 | Smallint |
| int32 | int_16 | Smallint |
| int32 | -- | Integer |
| int64 | Decimal | Numeric |
| int64 | -- | Bigint |
| int96 | -- | Timestamp |

**Note**: PXF supports filter predicate pushdown on all parquet data types listed above, *except* the `fixed_len_byte_array` and `int96` types.

### <a id="datatype_map_Write "></a>Write Mapping

PXF uses the following data type mapping when writing Parquet data:

| PXF/Greenplum Data Type | Original Type | Parquet Data Type |
|-------------------|---------------|--------------------------|
| Boolean | -- | boolean |
| Bytea | -- | binary |
| Bigint | -- | int64 |
| SmallInt | int_16 | int32 |
| Integer | -- | int32 |
| Real | -- | float |
| Float8 | -- | double |
| Numeric/Decimal | Decimal | fixed\_len\_byte\_array |
| Timestamp<sup>1</sup> | -- | int96 |
| Timestamptz<sup>2</sup> | -- | int96 |
| Date | date | int32 |
| Time | utf8 | binary |
| Varchar | utf8 | binary |
| Text | utf8 | binary |
| OTHERS | -- | UNSUPPORTED |

</br><sup>1</sup>&nbsp;PXF localizes a <code>Timestamp</code> to the current system timezone and converts it to universal time (UTC) before finally converting to <code>int96</code>.
</br><sup>2</sup>&nbsp;PXF converts a <code>Timestamptz</code> to a UTC <code>timestamp</code> and then converts to <code>int96</code>. PXF loses the time zone information during this conversion.


## <a id="profile_cet"></a>Creating the External Table

The PXF HDFS connector `hdfs:parquet` profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

Use the following syntax to create a Greenplum Database external table that references an HDFS directory:

``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-hdfs-dir>
    ?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export');
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```

The specific keywords and values used in the Greenplum Database [CREATE EXTERNAL TABLE](https://gpdb.docs.pivotal.io/latest/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html) command are described in the table below.

| Keyword  | Value |
|-------|-------------------------------------|
| \<path&#8209;to&#8209;hdfs&#8209;file\>    | The path to the directory in the HDFS data store. When the `<server_name>` configuration includes a [`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF considers \<path&#8209;to&#8209;hdfs&#8209;file\> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. \<path&#8209;to&#8209;hdfs&#8209;file\> must not specify a relative path nor include the dollar sign (`$`) character. |
| PROFILE    | The `PROFILE` keyword must specify `hdfs:parquet`. |
| SERVER=\<server_name\>    | The named server configuration that PXF uses to access the data. PXF uses the `default` server if not specified. |
| \<custom&#8209;option\>  | \<custom-option\>s are described below.|
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with `(FORMATTER='pxfwritable_export')` (write) or `(FORMATTER='pxfwritable_import')` (read). |
| DISTRIBUTED BY    | If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or \<column_name\> on both tables. Doing so will avoid extra motion of data between segments on the load operation. |

<a id="customopts"></a>
The PXF `hdfs:parquet` profile supports the following read option. You specify this option in the `CREATE EXTERNAL TABLE` `LOCATION` clause:

| Read Option  | Value Description |
|-------|-------------------------------------|
| IGNORE_MISSING_PATH | A Boolean value that specifies the action to take when \<path-to-hdfs-file\> is missing or invalid. The default value is `false`, PXF returns an error in this situation. When the value is `true`, PXF ignores missing path errors and returns an empty fragment. |

The PXF `hdfs:parquet` profile supports encoding- and compression-related write options. You specify these write options in the `CREATE WRITABLE EXTERNAL TABLE` `LOCATION` clause. The `hdfs:parquet` profile supports the following custom write options:

| Write Option  | Value Description |
|-------|-------------------------------------|
| COMPRESSION_CODEC    | The compression codec alias. Supported compression codecs for writing Parquet data include: `snappy`, `gzip`, `lzo`, and `uncompressed` . If this option is not provided, PXF compresses the data using `snappy` compression. |
| ROWGROUP_SIZE | A Parquet file consists of one or more row groups, a logical partitioning of the data into rows. `ROWGROUP_SIZE` identifies the size (in bytes) of the row group. The default row group size is `8 * 1024 * 1024` bytes. |
| PAGE_SIZE | A row group consists of column chunks that are divided up into pages. `PAGE_SIZE` is the size (in bytes) of such a page. The default page size is `1 * 1024 * 1024` bytes. |
| ENABLE\_DICTIONARY | A boolean value that specifies whether or not to enable dictionary encoding. The default value is `true`; dictionary encoding is enabled when PXF writes Parquet files. |
| DICTIONARY\_PAGE\_SIZE | When dictionary encoding is enabled, there is a single dictionary page per column, per row group. `DICTIONARY_PAGE_SIZE` is similar to `PAGE_SIZE`, but for the dictionary. The default dictionary page size is `1 * 1024 * 1024` bytes. |
| PARQUET_VERSION | The Parquet version; PXF supports the values `v1` and `v2` for this option. The default Parquet version is `v1`. |
| SCHEMA | The location of the Parquet schema file on the file system of the specified `SERVER`. |

**Note**: You must explicitly specify `uncompressed` if you do not want PXF to compress the data.

Parquet files that you write to HDFS with PXF have the following naming format: `<file>.<compress_extension>.parquet`, for example `1547061635-0000004417_0.gz.parquet`.

## <a id="parquet_write"></a> Example

This example utilizes the data schema introduced in [Example: Reading Text Data on HDFS](hdfs_text.html#profile_text_query).

| Column Name  | Data Type |
|-------|-------------------------------------|
| location | text |
| month | text |
| number\_of\_orders | int |
| total\_sales | float8 |

In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.

1. Use the `hdfs:parquet` profile to create a writable external table. For example:

    ``` sql
    postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, total_sales double precision)
        LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
      FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
    ```

2. Write a few records to the `pxf_parquet` HDFS directory by inserting directly into the `pxf_tbl_parquet` table. For example:

    ``` sql
    postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Frankfurt', 'Mar', 777, 3956.98 );
    postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 3812, 96645.37 );
    ```

3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in `pxf_parquet`, create a readable external Greenplum Database referencing this HDFS directory:

    ``` sql
    postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, total_sales double precision)
        LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
        FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
    ```

4. Query the readable external table `read_pxf_parquet`:

    ``` sql
    postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
    ```

    ``` pre
     location  | month | number_of_orders | total_sales
    -----------+-------+------------------+-------------
     Frankfurt | Mar   |              777 |     3956.98
     Cleveland | Oct   |             3812 |     96645.4
    (2 rows)
    ```

